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Abstract 

The  least  squares  (LS)  minimization  problem  constitutes  the  core  of  many  real-time  signal 
processing  problems,  such  as  adaptive  filtering,  system  identification  and  adaptive  beamform¬ 
ing.  Recently  efficient  implementations  of  the  recursive  least  squares  (RLS)  algorithm  and  the 
constrained  recursive  least  squares  (CRLS)  algorithm  based  on  the  numerically  stable  QR  de¬ 
composition  (QRD)  have  been  of  great  interest.  Several  papers  have  proposed  modifications  to 
the  rotation  algorithm  that  circumvent  the  square  root  operations  and  minimize  the  number  of 
divisions  that  are  involved  in  the  Givens  rotation.  It  has  also  been  shown  that  all  the  known 
square  root  free  algorithms  are  instances  of  one  parametric  algorithm.  Recently,  a  square  root 
free  and  division  free  algorithm  has  been  proposed  [4] . 

In  this  paper,  we  propose  a  family  of  square  root  and  division  free  algorithms  and  examine 
its  relationship  with  the  square  root  free  parametric  family.  We  choose  a  specific  instance 
for  each  one  of  the  two  parametric  algorithms  and  make  a  comparative  study  of  the  systolic 
structures  based  on  these  two  instances,  as  well  as  the  standard  Givens  rotation.  We  consider 
the  architectures  for  both  the  optimal  residual  computation  and  the  optimal  weight  vector 
extraction. 

The  dynamic  range  of  the  newly  proposed  algorithm  for  QRD- RLS  optimal  residual  com¬ 
putation  and  the  wordlength  lower  bounds  that  guarantee  no  overflow  are  presented.  The 
numerical  stability  of  the  algorithm  is  also  considered.  A  number  of  obscure  points  relevant  to 
the  realization  of  the  QRD-RLS  and  the  QRD-CRLS  algorithms  are  clarified.  Some  systolic 
structures  that  are  described  in  this  paper  are  very  promising,  since  they  require  less  compu¬ 
tational  complexity  (in  various  aspects)  than  the  structures  known  to  date  and  they  make  the 
VLSI  implementation  easier. 

SP  EDICS: 

5.2.  Algorithms  and  Application  Mappings 

5.1.  Architectures  and  VLSI  Hardware 


This  work  was  supported  in  part  by  the  ONR  grant  N00014-93-1-0566. 
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1  Introduction 


The  least  squares  (LS)  minimization  problem  constitutes  the  core  of  many  real-time  signal  process¬ 
ing  problems,  such  as  adaptive  filtering,  system  identification  and  beamforming  [6].  There  are  two 
common  variations  of  the  LS  problem  for  adaptive  signal  processing: 

1.  Solve  the  minimization  problem 

w(n)  =  argmin  ||  B(n)(X(n)w(n)  -  y(n ))  ||2,  (1) 

u/(n) 

where  X(n)  is  a  matrix  of  size  n  x  p,  w(n )  is  a  vector  of  length  p,  y(n )  is  a  vector  of  length 
n  and  B(n)  —  diag {/3n~1,/3ri~2,  •  •  •,  1},  0  <  /3  <  1,  that  is,  [3  is  a  forgetting  factor. 

2.  Solve  the  minimization  problem  in  (1)  subject  to  the  linear  constraints 

clTw(n)  =  r\i  =  1,2,-  --,N,  (2) 

where  c*  is  a  vector  of  length  p  and  r*  is  a  scalar.  In  this  paper,  we  consider  only  the  special 
case  for  which  y(n)  =  0  for  all  n. 

There  are  two  different  pieces  of  information  that  may  be  required  as  the  result  of  this  minimiza¬ 
tion  [6]: 

1.  The  optimizing  weight  vector  w(n)  and/or 

2.  the  optimal  residual  at  the  time  instant  n: 

e(f„)  =  X(tn)u>(n)  -  y(tn),  (3) 

where  X(tn)  is  the  last  row  of  the  matrix  X(n)  and  y(tn)  is  the  last  element  of  the  vector 

y(n). 

Efficient  implementations  of  the  recursive  least  squares  (RLS)  algorithms  and  the  constrained 
recursive  least  squares  (CRLS)  algorithms  based  on  the  QR  decomposition  (QRD)  were  first  intro¬ 
duced  by  McWhirter  [14],  [15].  A  comprehensive  description  of  the  algorithms  and  the  architectural 
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implementations  of  these  algorithms  is  given  in  [6,  chap. 14].  It  has  been  proved  that  the  QRD- 
based  algorithms  have  good  numerical  properties  [6].  However,  they  are  not  very  appropriate  for 
VLSI  implementation,  because  of  the  square  root  and  the  division  operations  that  are  involved  in 
the  Givens  rotation  and  the  back-substitution  required  for  the  case  of  weight  extraction. 

Several  papers  have  proposed  modifications  in  order  to  reduce  the  computational  load  involved 
in  the  original  Givens  rotation  [2,  5,  4,  8].  These  rotation-based  algorithms  are  not  rotations  any 
more,  since  they  do  not  exhibit  the  normalization  property  of  the  Givens  rotation.  Nevertheless, 
they  can  substitute  for  the  Givens  rotation  as  the  building  block  of  the  QRD  algorithm  and  thus 
they  can  be  treated  as  rotation  algorithms  in  a  wider  sense: 

Definition  1  A  Givens-rotation-based  algorithm  that  can  be  used  as  the  building  block  of  the  QRD 
algorithm  will  be  called  a  -Rotation  algorithm. 

A  number  of  square-root-free  Rotations  have  appeared  in  the  literature  [2],  [5],  [8],  [10].  It  has  been 
shown  that  a  square-root-free  and  division-free  Rotation  does  exist  [4].  Recently,  a  parametric 
family  of  square-root-free  Rotation  algorithms  was  proposed  in  [8];  it  was  also  shown  that  all  the 
known  square- root-free  Rotation  algorithms  belong  to  this  family,  which  is  called  the  ’’^zz-family”. 
In  this  paper  we  will  refer  to  the  /u'-family  of  Rotation  algorithms  with  the  name  parametric  pp 
\ Rotation .  We  will  also  say  that  a  Rotation  algorithm  is  a  pp  Rotation  if  this  algorithm  belongs 
to  the  pp- family.  Several  QRD-based  algorithms  have  made  use  of  these  Rotation  algorithms. 
McWhirter  has  been  able  to  compute  the  optimal  residual  of  the  RLS  algorithm  without  square 
root  operations  [14].  He  also  employed  an  argument  for  the  similarity  of  the  RLS  and  the  CRLS 
algorithms  to  obtain  a  square-root-free  computation  for  the  optimal  residual  of  the  CRLS  algo¬ 
rithm  [15].  A  fully-pipelined  structure  for  weight  extraction  that  circumvents  the  back-substitution 
divisions  was  also  derived  independently  in  [17]  and  in  [19].  Finally,  an  algorithm  for  computing 
the  RLS  optimal  residual  based  on  the  parametric  pp  Rotation  was  derived  in  [8]. 

In  this  paper,  we  introduce  a  parametric  family  of  square-root-free  and  division-free  Rotations. 
We  will  refer  to  this  family  of  algorithms  with  the  name  parametric  kX  Rotation.  We  will  also 
say  that  a  Rotation  algorithm  is  a  kX  Rotation  if  this  algorithm  is  obtained  by  the  parametric  kX 
Rotation  with  a  choice  of  specific  values  for  the  parameters  k  and  A.  We  employ  the  arguments 
in  [8],  [14],  [15]  and  [17]  in  order  to  design  novel  architectures  for  the  RLS  and  the  CRLS  algorithms 
that  have  less  computation  and  circuit  complexity.  Some  systolic  structures  that  are  described  here 
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are  very  promising,  since  they  require  the  minimum  computational  complexity  (in  various  aspects) 
known  to  date,  and  they  can  be  easily  implemented  in  VLSI. 

In  Section  2,  we  introduce  the  parametric  kX  -Rotation.  In  Section  3,  we  derive  the  RLS 
algorithms  that  are  based  on  the  parametric  kX  Rotation  and  we  consider  the  architectural  imple¬ 
mentations  for  a  specific  kX  Rotation.  In  Section  4,  we  follow  the  same  procedure  for  the  CRLS 
algorithms.  In  Section  5,  we  address  the  issues  of  dynamic  range,  lower  bounds  for  the  wordlength, 
stability  and  error  bounds.  We  conclude  with  Section  6.  In  the  Appendix  we  give  the  proofs  of 
some  lemmas  that  are  stated  in  the  course  of  the  paper. 


2  Square  Root  and  Division  free  Algorithms 

In  this  Section,  we  introduce  a  new  parametric  family  of  Givens- rotation-based  algorithms  that  re¬ 
quire  neither  square  root  nor  division  operations.  This  modification  to  the  Givens  rotation  provides 
a  better  insight  on  the  computational  complexity  optimization  issues  of  the  QR  decomposition  and 
makes  the  VLSI  implementation  easier. 


2.1  The  Parametric  k X  Rotation 


The  standard  Givens  rotation  operates  (for  real-valued  data)  as  follows: 


where 


I _ 

r2  ■ 

o 

i 

x'2  • 

xm 

(3r-i 

0r2  ■ 

1 - 

£ 

S3. 

Xl 

x2  ■ 

_  X\ 

\J 02r\  +  xl  y/P2rl  +  xj 

r'\  =  \JP2r\  +  x\ 


r'  =  cf3rj  +  sxj,  j  =  1, 2,  •  •  • ,  m 


x'j  —  —s/3rj  +  cxj,  j  =  2,3,---,m  . 


(4) 

(5) 

(6) 

(7) 

(8) 


3 


We  introduce  the  following  data  transformation: 


-  ji-aj,  Xj  -  J^bj,  r'  _  fij-a'j,  J  —  1,2,---, 


—6' 


m 


j  =  2,3,  •  --,m. 


(9) 


We  seek  the  square  root  and  division-free  expressions  for  the  transformed  data  a'- ,  j  =  1,2,  b' , 

j  =  2, 3,  •  •  • ,  m,  in  (6)  and  solving  for  a^,  we  get 


ai  = 


\  +  /a6?). 

*’a',b 


By  substituting  (5)  and  (9)  in  (7)  and  (8)  and  solving  for  a'-  and  b'-,  we  get 


a'j  = 


lbf32  aid  j  +  labibj 


y/lalb(lbp2aj  +  /a6?)/// 
We  will  let  and  l[  be  equal  to 


and  b'A  = 


-bx/3dj  +  fla-ibj 

3  sj(h{32a\  +  lab\)ll  'b 


j  =  2,3, 


(10) 


(11) 


=  /a  W2a2  +  /a&2)«2,  /'  =  (4/J2g2  +  /a62)A2, 


(12) 


where  n  and  A  are  two  parameters.  By  substituting  (12)  in  (10)-(11)  we  obtain  the  expressions 

a[  =  n(lb(32a\  +  lab\)  (13) 

a'j  =  K(lbf32  d\dj  +  labibj),  j  =  2, and  (14) 

bj  =  Xfii-hdj  +  a-ibj),  j  =  2, 3,  •  •  • ,  m.  (15) 

If  the  evaluation  of  the  parameters  k  and  A  does  not  involve  any  square  root  or  division  operations, 
the  update  equations  (12)-(15)  will  be  square  root  and  division-free.  In  other  words,  every  such 
choice  of  the  parameters  n  and  A  specifies  a  square  root  and  division-free  dotation  algorithm. 

Definition  2  Equations  (12)-(15)  specify  the  parametric  kA  dotation  algorithm.  Furthermore,  a 
Rotation  algorithm  will  be  called  a  kX  dotation  if  it  is  specified  by  (12)- (15)  for  specific  square- 
root-free  and  division-free  expressions  of  the  parameters  k  and  A. 
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One  can  easily  verify  that  the  only  one  square  root  and  division-free  dotation  in  the  literature  to 
date  [4]  is  a  kA  dotation  and  is  obtained  by  choosing  k  =  A  =  1. 


2.2  The  Relation  between  the  Parametric  kX  and  the  Parametric  pu  dotation 

Let 


ka  —  !  ■>  kb  —  j  ,  ka  —  v  ,  kb  —  , . 

Ik 


L 


W 


(16) 


We  can  express  k'a  and  k[  in  terms  of  ka  and  kb  as  follows  [8]: 


K  -  (kap2a\  +  kbbfj  /  p2 ,  k'b  = 


kakb  1 
H2u2  k'a ' 


(17) 


If  we  substitute  (16)  and  (17)  in  (12)  and  solve  for  p  and  v  we  obtain 


K(kaf52a\  +  khb\) 

H  = - — - v  =  A. 


kakb 


The  above  provides  a  proof  for  the  following  Lemma: 


(18) 


Lemma  2.1  For  each  square  root  and  division-free  pair  of  parameters  (k,  A)  that  specifies  a  kX 
Rotation  algorithm  Al,  we  can  find  square-root-free  parameters  (p(k),u( A))  with  two  properties: 
first,  the  pair  (p(k),u(X))  specifies  a  pu  Rotation  algorithm  A2  and  second,  both  Al  and  .42  are 
mathematically  equivalent1 .  □ 

Consequently,  the  set  of  kA  dotation  algorithms  can  be  thought  of  as  a  subset  of  the  set  of  the  pu 
dotations.  Furthermore,  (18)  provides  a  means  of  mapping  a  kX  dotation  onto  a  pu  dotation.  For 
example,  one  can  verify  that  the  square  root  and  division-free  algorithm  in  [4]  is  a  pu  Kotcition  and 
is  obtained  for 

ka(32a\  +  hb\ 

" =  Kh  '  "  =  1- 

In  Fig.  1,  we  draw  a  graph  that  summarizes  the  relations  among  the  classes  of  algorithms  based  on 
QR  decomposition,  a  dotation  algorithm,  a  pu  dotation  and  a  K.X  dotation. 


1  They  evaluate  logically  equivalent  equations. 
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3  RLS  Algorithm  and  Architecture 

In  this  Section,  we  consider  the  kX  dotation  for  optimal  residual  and  weight  extraction  using  systolic 
array  implementation.  Detailed  comparisons  with  existing  approaches  are  presented. 


3.1  A  Novel  Fast  Algorithm  for  the  RLS  Optimal  Residual  Computation 

The  QR-decomposition  of  the  data  at  time  instant  n  is  as  follows: 


R(n)  u(n) 
0r  v(tn) 


T(n) 


0R(n  -  1)  f3u(n  -  1) 

X(tn)  y(tn) 


(19) 


where  T(n)  is  a  unitary  matrix  of  size  (p+l)x(p+l)  that  performs  a  sequence  of  p  Givens 
rotations.  This  can  be  written  symbolically  as 


- 1 

N~’T 

o 

_ 1 

/ 3R(n )  (3u(n) 

0T  lq(n)  2 

X(tn)  y{tn) 

where 


and 


T(n) 

L(n  +  1)  2 

0 

(3R(n  +  1)  (3u(n  +  1) 

0T 

lq(n  +  1 )  2 

- 1 

St 

o 

_ I 

lq(n)-iX{tn)  =  X(tn),  lg{n)~5 y(tn)  =  y(tn) 


L(n)  =diag{h,/2,  •  •  -,/p} 


R(n)  = 


u(n)  —  ,p+-i  ®i,p+i 

[X(t„)  y(tn)\  =  [h  b2 


L(n  +  1)  =diag ■  ■  ■  ,1'  }, 


an  a  12  • 

Clip 

a'\  1  a12  ' 

•  aip 

022  • 

•  a2p 

R(n  +  1)  = 

a22  ' 

•  a2p 

dpp 

app 

®p,p+i]  u(n  T  1) 

•  •  bp  . 


a'\}P+\  gi,p+i 


a 


P,p+ 1 


(20) 


(21) 
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Equations  (12)-(15)  imply  that  the  dotation  is  specified  as  follows: 

l'i  =  S,_1)/324  +  Ubt^Wi  (22) 

l^  =  {l^-l)H2al  +  kb{rl)2)\l  (23) 

a'ij  =  Ki(ll'~1)P2auaij  +  j  =  »,  i  +  1,  •  •  -,p  +  1  (24) 

h(!)  =  Ai/3(-6f_1)aty  +  j  =  i  +  1,  i  +  2,  •  •  -,p  +  1,  (25) 


where  i  —  1,2,-  •  ■  ,p  ,  =  bj,  j  =  1  and  =  lq.  For  the  optimal  residual  we  have: 

Lemma  3.1  If  the  parametric  k\  dotation  is  used  in  the  QRD-RLS  algorithm,  the  optimal  residual 
is  given  by  the  expression 

eRLs(tn)  =  -  [  n  *iPaii  J  K^a,PPb^hv,  (26) 

\i= l  /  Avapp 

where  v  =  1/  yJTq  if  p  is  an  even  number  and  v  =  ^JTq  if  p  is  an  odd  number.  □ 

The  proof  is  given  in  the  Appendix. 

Here,  lq  is  a  free  variable.  If  we  choose  lq  —  1  we  get  v  =  1  for  both  even  and  odd  values  of  p  and 
we  can  avoid  the  square  root  operation.  We  can  see  that  for  a  recursive  computation  of  (26)  only 
one  division  operation  is  needed  at  the  last  step  of  the  recursion.  This  compares  very  favorably 
with  the  square  root  free  fast  algorithms  that  require  one  division  for  every  recursion  step,  as  well 
as  with  the  original  approach  (62),  which  involves  one  division  and  one  square  root  operation  for 
every  recursion  step. 

The  division  operation  in  (26)  cannot  be  avoided  by  proper  choice  of  expressions  for  the  pa¬ 
rameters  k  and  A.  This  is  restated  by  the  following  Lemma,  which  is  proved  in  the  Appendix: 

Lemma  3.2  If  a  nX  Rotation  is  used,  the  RLS  optimal  residual  evaluation  will  require  at  least  one 
division  evaluation.  □ 

Note  that  the  proper  choice  of  the  expression  for  the  parameter  Ap,  along  with  the  rest  of  the 

parameters,  is  an  open  question,  since  the  minimization  of  the  multiplication  operations,  as  well  as 

communication  and  stability  issues  have  to  be  considered. 
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3.2  A  Systolic  Architecture  for  the  Optimal  RLS  Residual  Evaluation 

McWhirter  has  used  a  systolic  architecture  for  the  implementation  of  the  QR  decomposition  [14]. 
This  architecture  is  modified,  so  that  equations  (22)-(26)  be  evaluated  for  the  special  case  of  k;  = 
A,-  =  1,2  —  1,2 -,p  and  lq  —  1.  The  systolic  array,  as  well  as  the  memory  and  the  communication 
links  of  its  components,  are  depicted  in  Fig.  2  2.  The  boundary  cells  (cell  number  1)  are  responsible 
for  evaluating  (22)  and  (23),  as  well  as  the  coefficients  c,-  =  lq~^an  and  s,-  =  and  the 

partial  products  e,-  =  The  internal  cells  (cell  number  2)  are  responsible  for  evaluating 

(24)  and  (25).  Finally,  the  output  cell  (cell  number  3)  evaluates  (26).  The  functionality  of  each 
one  of  the  cells  is  described  in  Fig.  2.  We  will  call  this  systolic  array  51.1. 

On  Table  1,  we  collect  some  features  of  the  systolic  structure  51.1  and  the  two  structures,  51.2 
and  51.3,  in  [14]  that  are  pertinent  to  the  circuit  complexity.  The  51.2  implements  the  square- root- 
free  QRD-RLS  algorithm  with  p  =  v  —  1,  while  51.3  is  the  systolic  implementation  based  on  the 
original  Givens  rotation.  In  Table  1,  the  complexity  per  processor  cell  and  the  number  of  required 
processor  cells  are  indicated  for  each  one  of  the  three  different  cells  3.  One  can  easily  observe 
that  51.1  requires  only  one  division  operator  and  no  square  root  operator,  51.2  requires  p  division 
operators  and  no  square  root  operator,  while  51.3  requires  p  division  and  p  square  root  operators. 
This  reduction  of  the  complexity  in  terms  of  division  and  square  root  operators  is  penalized  with 
the  increase  of  the  number  of  the  multiplications  and  the  communication  links  that  are  required. 

Apart  from  the  circuit  complexity  that  is  involved  in  the  implementation  of  the  systolic  struc¬ 
tures,  another  feature  of  the  computational  complexity  is  the  number  of  operations-per-cycle.  This 
number  determines  the  minimum  required  delay  between  two  consecutive  sets  of  input  data.  For 
the  structures  51.2  and  51.3  the  boundary  cell  (cell  number  1)  constitutes  the  bottleneck  of  the 
computation  and  therefore  it  determines  the  operations-per-cycle  that  are  shown  on  Table  5.  For 
the  structure  51.1  either  the  boundary  cell  or  the  output  cell  are  the  bottleneck  of  the  computation. 

2Note  the  aliases:  =  <rm,  =  <r0utj,  =  far,  =  r,  ^  =  bout, 

et— 1  =  CinjCi  =  tout' 

3The  multiplications  with  the  constants  fi  and  d2  are  not  encountered. 
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3.3  A  Systolic  Architecture  for  the  Optimal  RLS  Weight  Extraction 


Shepherd  et  al.  [17]  and  Tang  et  al.  [19]  have  independently  shown  that  the  optimal  weight  vector 
can  be  evaluated  in  a  recursive  way.  More  specifically,  one  can  compute  recursively  the  term 


R  H(n)  by 


R~H{n) 

# 


=  T(n) 


lR-H{n-l) 


(27) 


and  then  use  parallel  multiplication  for  computing  wT(n )  by 


wT(n)  —  uT(n )  ^ R  H(n)j  . 


(28) 


The  symbol  #  denotes  a  term  of  no  interest.  The  above  algorithm  can  be  implemented  by  a  fully 
pipelined  systolic  array  that  can  operate  in  two  distinct  modes,  0  and  1.  The  initialization  phase 
consists  of  2 p  steps  for  each  processor.  During  the  first  p  steps  the  processors  operate  in  mode  0 
in  order  to  calculate  a  full  rank  matrix  R.  During  the  following  p  steps,  the  processors  operate  in 
mode  1  in  order  to  compute  R~H ,  by  performing  a  task  equivalent  to  forward  substitution.  After 
the  initialization  phase  the  processors  operate  in  mode  0.  In  [17]  one  can  find  the  systolic  array 
implementations  based  both  on  the  original  Givens  rotation  and  the  Gentleman’s  variation  of  the 
square-root-free  dotation,  that  is,  the  pu  dotation  for  p  =  u  =  1.  We  will  call  these  two  structures 
52.3  and  52.2  respectively. 

In  Fig.  3,  we  present  the  systolic  structure  52.1  based  on  the  nX  dotation  with  k;  =  A,-  =  1,7  = 
1,2 This  is  a  square- root-free  and  division-free  implementation.  The  boundary  cells  (cell 
number  1)  are  slightly  simpler  than  the  corresponding  ones  of  the  array  51.1.  More  specifically, 
they  do  not  compute  the  partial  products  e;.  The  internal  cells  (cell  number  2),  that  compute  the 
elements  of  the  matrix  R,  are  identical  to  the  corresponding  ones  of  the  array  51.1.  The  cells  that 
are  responsible  for  computing  the  vector  u  (cell  number  3)  differ  from  the  other  internal  cells  only 
in  the  fact  that  they  communicate  their  memory  value  with  their  right  neighbors.  The  latter  (cell 
number  4)  are  responsible  for  evaluating  (28)  and  (27).  The  functionality  of  the  processing  cells, 
as  well  as  their  communication  links  and  their  memory  contents,  are  given  in  Fig.  3.  The  mode  of 
operation  of  each  cell  is  controlled  by  the  mode  bit  provided  from  the  input.  For  a  more  detailed 
description  of  the  operation  of  the  mode  bit  one  can  see  [15]  and  [17]. 
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On  Tables  2  and  5,  we  collect  some  computational  complexity  metrics  for  the  systolic  arrays 
52.1,  52.2  and  52.3,  when  they  operate  in  mode  04.  The  conclusions  we  can  draw  are  similar  to 
the  ones  we  had  for  the  circuits  that  calculate  the  optimal  residual:  the  square  root  operations  and 
the  division  operations  can  be  eliminated  with  the  cost  of  an  increased  number  of  multiplication 
operations  and  communication  links.  We  should  also  note  that  52.1  does  require  the  implementation 
of  division  operators  in  the  boundary  cells,  since  these  operators  are  used  during  the  initialization 
phase.  Nevertheless,  after  the  initialization  phase  the  circuit  will  not  suffer  from  any  time  delay 
caused  by  division  operations.  The  computational  bottleneck  of  all  three  structures,  52.1,  52.2 
and  52.3,  is  the  boundary  cell,  thus  it  determines  the  operations-per-cycle  metric. 

As  a  conclusion  for  the  RLS  architectures,  we  observe  that  the  figures  on  Tables  1,  2  and  5 
favor  the  architectures  based  on  the  kA  dotation,  k  —  A  =  1  versus  the  ones  that  are  based  on  the 
fu>  rotation  with  /x  =  v  —  1  and  the  standard  Givens  rotation.  This  claim  is  clearly  substantiated 
by  the  delay  times  on  Table  5,  associated  to  the  DSP  implementation  of  the  QRD-RLS  algorithm. 
These  delay  times  are  calculated  on  the  basis  of  the  manufacturers  benchmark  speeds  for  floating 
point  operations  [18].  The  readers  may  have  to  bear  in  mind  that  the  weight  extraction  of  [17]  is 
not  a  good  form  due  to  the  updating  of  i?-1  if  the  weight  vector  at  each  time  instant  is  required. 

4  CRLS  Algorithm  and  Architecture 

The  optimal  weight  vector  w\n)  and  the  optimal  residual  elCRLS{tn)  that  correspond  to  the 
constraint  vector  cl  are  given  by  the  expressions  [15]: 

^(»)  =  TT-vli  nR~\n)z\n)  (29) 

II  2  (n)  II 

zi(n )  ii 

X(tn)R-\n)z'(n).  (31) 

4The  multiplications  with  the  constants  /?,/92,l//?  and  1  / /?2 ,  as  well  as  the  communication  links  that  drive  the 
mode  bit,  are  not  encountered. 


and 


eCHLs(f«) 


where 


CCRLs(tn)  — 
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The  term  zl(n)  is  defined  as  follows 


z\n)  =  R-H(ny 


(32) 


and  it  is  computed  with  the  recursion  [15] 

\ z\n-  !) 

0T 

where  the  symbol  #  denotes  a  term  of  no  interest.  In  this  Section,  we  derive  a  variation  of  the 
recursion  that  is  based  on  the  parametric  kA  dotation.  Then,  we  design  the  systolic  arrays  that 
implement  this  recursion  for  k  =  A  =  1.  We  also  make  a  comparison  of  these  systolic  structures  with 
those  based  on  the  Givens  rotation  and  the  \iv  dotation  introduced  by  Gentleman  [6,  2,  15,  17]. 

From  (32)  and  (21)  we  have  -?'(n)  =  (^T(n)_1/2^(n))  H  c{  and  since  L(n)  is  a  diagonal  real 
valued  matrix  we  get  z\n )  =  X(n)1/2 R(n)~H c\  where  c!  is  the  constraint  direction.  If  we  let 


z\n) 

# 


=  T(n) 


z\n )  =  L(n)R(n)  H cl 


(34) 


we  obtain 

z\n)  =  L(n)~ll2z\n).  (35) 

From  (35)  we  get  ||  zx(n)  ||2=  z'H {n)L~x{n)zl(n).  Also,  from  (21)  and  (35)  we  get  R~1(n)zt(n )  = 
/j-1(n)F(n).  Consequently,  from  (30),  (31)  and  (29)  we  have 


and 


where 


eCRLs(n )  -  -iH(  \T-\(  \~i(  yeRLsi71) 
zl  (n)L  l(n)zx(n) 


”'(n)  =  ?»(n)L->(n)?(n)5',(")?'<’*)’ 


elCRLs(n)  =  X(n)R  V)z’(rc)- 


(36) 


(37) 


(38) 


Because  of  the  similarity  of  (31)  with  (38)  and  (29)  with  (37)  we  are  able  to  use  a  variation  of  the 
systolic  arrays  that  are  based  on  the  Givens  rotation  [15,  17]  in  order  to  evaluate  (36)-(37). 
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4.1  Systolic  Architecture  for  the  Optimal  CRLS  Residual  Evaluation 

From  (26)  and  (36),  if  lq  =  1,  we  get  the  optimal  residual 


eCRLs(n ) 


rl 

z,H(n)i/~1(n)2t(n) 


(il 


Kna 


p^ppAp) 

■  °p+ I' 


^ Pa'pp 


(39) 


In  Fig.  4,  we  present  the  systolic  array  53.1,  that  evaluates  the  optimal  residual  for  Kj  —  A ,•  = 
1  ,j  =  1,2,  •••,£>,  and  the  number  of  constraints  is  N  —  2.  This  systolic  array  is  based  on  the 
design  proposed  by  McWhirter  [15].  It  operates  in  two  modes  and  is  in  a  way  very  similar  to  the 
operation  of  the  systolic  structure  52.1  (see  Section  3).  The  recursive  equations  for  the  delta  of  the 
matrix  R  are  given  in  (22)-(25).  They  are  evaluated  by  the  boundary  cells  (cell  number  1)  and  the 
internal  cells  (cell  number  2).  These  internal  cells  are  identical  to  the  ones  of  the  array  52.1.  The 
boundary  cells  have  a  very  important  difference  from  the  corresponding  ones  of  52.1:  while  they 
operate  in  mode  0,  they  make  use  of  their  division  operators  in  order  to  evaluate  the  elements  of 
the  diagonal  matrix  X-1(n),  i.e.  the  quantities  1  //,-,i  =  1,2 These  quantities  are  needed 
for  the  evaluation  of  the  term  2,H(n)i(n)_12'(n)  in  (39).  The  elements  of  the  vectors  z1  and  z 2 
are  updated  by  a  variation  of  (24)  and  (25),  for  which  the  constant  (3  is  replaced  by  1/(3.  The  two 
columns  of  the  internal  cells  (cell  number  3)  are  responsible  for  these  computations.  They  initialize 
their  memory  value  during  the  second  phase  of  the  initialization  (mode  1)  according  to  (34).  While 
they  operate  in  mode  0,  they  are  responsible  for  evaluating  the  partial  sums 

m  =  £  II  4  II2  /!,-  (40) 

3= 1 


The  output  cells  (cell  number  4)  are  responsible  for  the  final  evaluation  of  the  residual5. 

McWhirter  has  designed  the  systolic  arrays  that  evaluate  the  optimal  residual,  based  on  either 
the  Givens  rotation  or  the  square- root-free  variation  that  was  introduced  by  Gentleman  [15,  2]. 
We  will  call  these  systolic  arrays  53.3  and  53.2  respectively.  On  Tables  3  and  5  we  collect  some 
computational  complexity  metrics  for  the  systolic  arrays  53.1,  53.2  and  53.3,  when  they  operate  in 
mode  0  6.  We  observe  that  the  dotation-based  53.2,  outperforms  the  kA  dotation-based  53.1. 

'Note  the  alias  r*  =  r. 

6The  multiplications  with  the  constants  /3,/32,l//3  and  l//?2,  as  well  as  the  communication  links  that  drive  the 
mode  bit,  are  not  encountered. 
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The  two  structures  require  the  same  number  of  division  operators,  while  53.2  needs  less  multipliers 
and  also  it  has  less  communication  overhead. 

4.2  A  Systolic  Architecture  for  the  Optimal  CRLS  Weight  Vector  Extraction 

In  Fig.  5,  we  present  the  systolic  array  that  evaluates  (37)  for  Kj  =  A j  =  1  ,j  =  1,2,-  •  -,p  and  the 
number  of  constraints  equal  to  TV  =  2.  This  systolic  array  operates  in  two  modes,  just  as  the  arrays 
52.1  and  53.1  do.  The  boundary  cell  (cell  number  1)  is  responsible  for  evaluating  the  diagonal 
elements  of  the  matrices  R  and  T,  the  variable  lq ,  as  well  as  all  the  coefficients  that  will  be  needed 
in  the  computations  of  the  internal  cells.  In  mode  0  its  operation  is  identical  to  the  operation  of  the 
boundary  cell  in  52.1,  while  in  mode  1  it  behaves  like  the  corresponding  cell  of  53.1.  The  internal 
cells  in  the  left  triangular  part  of  the  systolic  structure  (cell  number  2)  evaluate  the  non-diagonal 
elements  of  the  matrix  R  and  they  are  identical  to  the  corresponding  cells  of  53.1.  The  remaining 
part  of  the  systolic  structure  is  a  2-layer  array.  The  cells  in  the  first  column  of  each  layer  (cell 
number  3)  are  responsible  for  the  calculation  of  the  vector  z1  and  the  partial  summations  (40). 
They  also  communicate  their  memory  values  to  their  right  neighbors.  The  latter  (cell  number  4) 
evaluate  the  elements  of  the  matrix  R~n  and  they  are  identical  to  the  corresponding  elements  of 
52.1.  The  output  elements  (cell  number  5)  are  responsible  for  the  normalization  of  the  weight 
vectors  and  they  compute  the  final  result. 

Shepherd  et  al.  [17]  and  Tang  et  al.  [19]  have  designed  systolic  structures  for  the  weight  vector 
extraction  based  on  the  Givens  rotation  and  the  square- root-free  dotation  of  Gentleman  [2].  We 
will  call  these  two  arrays  54.3  and  54.2  respectively.  On  Tables  4  and  5,  we  show  the  computational 
complexity  metrics  for  the  systolic  arrays  54.1,  54.2  and  54.3,  when  they  operate  in  mode  0.  The 
observations  we  make  are  similar  to  the  ones  we  have  for  the  systolic  arrays  that  evaluate  the  RLS 
weight  vector  (see  Section  3). 

Note  that  each  part  of  the  2-layer  structure  computes  the  terms  relevant  to  one  of  the  two 
constraints.  In  the  same  way,  a  problem  with  N  constraints  will  require  an  N- layer  structure. 
With  this  arrangement  of  the  multiple  layers  we  obtain  a  unit  time  delay  between  the  evaluation 
of  the  weight  vectors  for  the  different  constraints.  The  price  we  have  to  pay  is  the  global  wiring  for 
some  of  the  communication  links  of  cell  3.  A  different  approach  can  also  be  considered:  we  may 
place  the  multiple  layers  side  by  side,  one  on  the  right  of  the  other.  In  this  way,  not  only  the  global 
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wiring  will  be  avoided,  but  also  the  number  of  communication  links  of  cell  3,  will  be  considerably 
reduced.  The  price  we  will  pay  with  this  approach  is  a  time  delay  of  p  units  between  consequent 
evaluations  of  the  weight  vectors  for  different  constraints. 

As  a  conclusion  for  the  CRLS  architectures,  we  observe  that  the  figures  on  Tables  3,  4  and  5 
favor  the  architectures  based  on  the  pv  dotation,  p  —  u  —  1  versus  the  ones  that  are  based  on  the 
k\  rotation  with  k  =  A  =  1. 

5  Dynamic  Range,  Stability,  and  Error  Bounds 

Both  the  k\  and  pv  -Rotation  algorithms  enjoy  computational  complexity  advantages  compared 
to  the  standard  Givens  rotation  with  the  cost  of  the  denormalization  of  the  latter.  Consequently, 
the  numerical  stability  of  the  QRD  architectures  based  on  these  algorithms  can  be  questioned. 
Furthermore,  a  crucial  piece  of  information  in  the  circuit  design  is  the  wordlength,  that  is  the 
number  of  bits  per  word  required  to  ensure  correct  operations  of  the  algorithm  without  overflow. 
At  the  same  time,  the  wordlength  has  large  impact  on  the  complexity  and  the  speed  of  the  hardware 
implementation.  In  this  Section,  we  address  issues  on  stability,  error  bounds  and  lower  bounds  for 
the  wordlength  by  means  of  dynamic  range  analysis.  We  focus  on  the  algorithm  for  RLS  optimal 
residual  extraction  based  on  a  kA  Rotation.  The  dynamic  range  of  the  variables  involved  in  the 
other  newly  introduced  algorithms  can  be  computed  in  a  similar  way. 

In  [13],  Liu  et  al.  study  the  dynamic  range  of  the  QRD-RLS  algorithm  that  utilizes  the  standard 
Givens  rotation.  This  study  is  based  on  the  fact  that  the  rotation  parameters  generated  by  the 
boundary  cells  of  the  systolic  QRD-RLS  structure  eventually  reach  a  quasi-steady-state  regardless 
of  the  input  data  statistics,  provided  that  the  forgetting  factor  (3  is  close  to  one.  A  worst  case 
analysis  of  the  steady  state  dynamic  range  reveals  the  bound  [13] 

lim  \rij(n)\  <  ^ = =  \xmax\  =  7%,  j  =  i,i  +  1,  •  •  -,p  +  1  (41) 

for  the  contents  of  the  processing  elements  of  the  row  in  the  systolic  structure,  i  —  1, 2,  •  •  •  ,p, 
where  \xmax\  is  the  largest  value  in  the  input  data.  Similarly,  at  the  steady  state  the  output  of  the 
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jth  row  ^(O  j  —  +  1  is  bounded  by  [13] 


lim 

n— *■  00 


x^\n) 


<  (2 py-1  \xmax\  =  nxt,  j  =  i  + 1, i  +  2,  •  •  •  ,P  + 1. 


(42) 


Furthermore,  the  optimal  residual  eRLS  is  bounded  by  [13] 


lim  \eRLS(n)\  <  (2/?)p  1  \xmax\  =  ft- 


The  latter  is  a  BIBO  stability  result  that  applies  also  for  the  QRD-RLS  algorithm  based  on  a 
kA  dotation.  Nevertheless,  the  internal  stability  is  not  guaranteed.  More  concretely,  the  terms 
involved  in  the  QRD-RLS  algorithm  may  not  be  upper  bounded. 

In  view  of  the  internal  stability  problem,  a  proper  choice  of  the  parameters  k  and  A  should  be 
made.  A  correct  choice  will  compensate  for  the  denormalization  of  the  type 


(43) 


where  /•  and  are  given  in  (22)  and  (23)  respectively.  The  terms  k}  and  Xf  in  (22)  and  (23)  can 
be  used  as  shift  operators  by  choosing 


Ki  —  2  px  and  A;  =  2  T’,  i=  1,2, •••,£,  (44) 

where  pi  and  rt-  take  integer  values.  For  instance,  in  (23),  if  Ti  >  0  the  effect  of  X]  on  (/,'  1^j32a,fi  + 
lib\t~1)  )  will  be  a  right  shift  of  2r,  bits.  We  can  ensure  that 

0.5<Z'<2  and  0.5  <  l®  <  2,  *  =  l,2,---,p  (45) 

by  forcing  the  most  significant  bit  (MSB)  of  the  binary  representation  to  be  either  at  position  2°  or 
2-1  after  the  shift  operation.  This  normalization  task  has  been  introduced  in  [1]  and  further  used 
in  [4].  It  can  be  described  in  analytic  terms  by  the  expression 

shift _amount(unnormalized_quantity)  =  L{log2  (unnormalized_quantity)  +  1}  / 2] 
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and  it  can  be  implemented  very  easily  in  hardware. 

In  the  sequel,  we  consider  the  k\  -Rotation  by  choosing: 


Pi  —  shift-amount  /t7g!  ^(4*  ^/32ct^  +  ^  )| 

Ti  =  shift-amount  (52  a2-  +  )|  , 


(46) 


for  i  =  1,2,  [4].  Note  that  (46)  along  with  (44)  should  precede  (22)-(25)  in  the  rotation 

algorithm.  In  conformity  to  [1]  and  [4]  we  will  refer  to  the  resulting  rotation  algorithm  with  the 
name  scaled  rotation. 

The  systolic  array  that  implements  the  QRD-RLS  algorithm  for  the  optimal  residual  extraction 
is  depicted  in  Fig.  6.  A  comparison  of  this  systolic  array  with  the  one  in  Fig.  2  is  summarized  by 
the  following  points:  The  boundary  cells  generate  the  shift  quantities  p  and  r  associated  with  the 
parameters  k  and  A  respectively,  and  they  communicate  them  horizontally  with  the  internal  cells. 
This  yields  two  additional  links  for  the  boundary  cells  and  four  additional  ones  for  the  internal 
cells.  In  the  dynamic  range  study  that  follows,  we  show  that  the  number  of  bits  these  links  occupy 
is  close  to  the  logarithm  of  the  number  of  bits  required  by  the  rest  of  the  links.  The  boundary  cells 
are  also  responsible  for  computing  the  quantities  Yi^=i  Pan  and  nf=i  >n  (26).  In  this  case,  A,  is 
an  exponential  term  according  to  (44),  so  the  above  product  can  be  computed  as  the  running  sum 
of  the  exponents 

X 

9z  =  J2Tk'  *  =  1,2,  • --,p- 1,  (47) 

k= 1 

yielding  an  additional  adder  for  the  boundary  cells.  Finally,  as  far  as  the  boundary  cells  are 
concerned,  we  observe  that  the  cell  at  position  (p.p)  of  the  systolic  array  is  not  identical  to  the  rest 
of  the  boundary  cells.  This  is  a  direct  consequence  of  (26).  On  the  other  hand,  the  shift  operators 
constitute  the  only  overhead  of  the  internal  and  the  output  cells  compared  with  the  corresponding 
ones  in  Fig.  2.  Overall,  the  computational  complexity  (in  terms  of  operator  counts)  is  slightly 
higher  than  that  of  the  systolic  array  with  k  =  A  =  1. 

Let  us  focus  now  on  the  dynamic  range  of  the  variables  in  the  systolic  array.  By  solving  (43)  for 
a'-  and  using  (41)  and  (45)  one  can  compute  an  upper  bound  for  a,y  at  the  steady  state,  thus  one 
can  specify  the  dynamic  range  of  the  row  cell  content.  A  similar  result  can  be  obtained  for  the 
output  of  the  row  by  using  (42),  (43)  and  (45).  The  results  are  summarized  by  the  following 
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Lemma: 


Lemma  5.1  The  steady  state  dynamic  range  of  the  cell  content  TZ “  and  output  range  TZ1’  in  the  i^1 
row  are  given  by 


lim  \aij(n)\  <  TZ f  =  \f2TZf  and 


lim 

n— kx> 


b{-\n) 


<  TZ\  = 


(48) 


respectively. 

The  lower  bounds  in  the  wordlength  come  as  a  direct  consequence  of  Lemma  5.1:  The  wordlength 

of  the  cell  content  Bf  and  output  B\  in  the  row  must  be  lower  bounded  by 

Bf  >  \Bf  +  0.5]  and  B\  >  \Bf  +  0.5]  (49) 

respectively,  where  Bf  —  [log2  TZf~\  and  Bf  =  [log2  TZf\  are  the  corresponding  wordlength  lower 

bounds  for  the  QRD-RLS  implementation  based  on  the  standard  Givens  rotation. 

The  parameters  k,-,  A;  are  communicated  via  their  exponents  pl  and  T{.  The  dynamic  ranges  of 
these  exponents  are  given  by  Lemma  5.2  which  is  proved  in  the  Appendix. 

Lemma  5.2  The  steady  state  dynamic  range  of  the  terms  pz  and  r,  at  the  i^1  row  TZP{  and  7 Zj  are 
given  by 

limn_.00  \Pl\  <TZPX=  Bf  +  2.5 

A  (5®) 

lim^  |r,|  <  TZ]  =  Bf  +  1.5 

respectively7 . 

Consequently,  the  lower  bounds  on  the  wordlength  Bf  and  BJ  of  p,  and  T{  are 

Bf  >  [log  {Bf  +  2.5)1  and  Bf  >  [log  {Bf  +  1.5)1  (51) 


respectively. 

For  the  computation  of  the  optimal  residual  the  boundary  cells  need  to  evaluate  both  the 
running  product  e;  =  II*=i  Pakk  and  the  running  sum  in  (47).  The  dynamic  ranges  for  these  terms 
are  given  by  the  following  Lemma: 

7For  the  sake  of  simplicity  in  notation  we  have  dropped  the  time  parameter  n  from  the  expression  in  the  limit 
argument. 
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The  implied  wordlength  lower  bounds  are  Bf  >  5“  +  1,  B f  >  B\  +  l,Bf  >  Bf  and  Bf  >  Ir¬ 
respectively. 

In  summary,  (45),  (48),  (50),  (52)  and  (54)  show  that  all  the  internal  parameters  are  bounded 
and  therefore  the  algorithm  is  stable.  Furthermore,  the  lower  bounds  on  the  wordlength  provide 
the  guidelines  for  an  inexpensive,  functionally  correct  realization. 

The  error  bound  of  the  whole  QRD  to  a  given  matrix  A  6  3f?mXn  due  to  floating  point  operations 
is  given  by  [1,  4] 

\\6A\\  <  r(m  +  n  —  3)(1  +  tT+^UW  +  0(e2),  (55) 
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where  r  is  the  upper  bound  and  c  is  the  largest  number  such  that  1  +  e  is  computed  as  1.  If  (44) 
and  (45)  are  satisfied,  for  k  =  A  =  1,  then  it  follows  that  r  =  6.5e  [4],  This  is  fairly  close  to  the 
standard  Givens  rotation  which  has  r  =  6.0c  [4]. 

6  Conclusion 

We  introduced  the  parametric  kA  Rotation ,  which  is  a  square-root-free  and  division-free  algorithm, 
and  showed  that  the  parametric  k\  dotation  describes  a  subset  of  the  pv  dotation  algorithms  [8]. 
We  then  derived  novel  architectures  based  on  the  k\  dotation  for  k  =  A  =  1  and  made  a  com¬ 
parative  study  with  the  standard  Givens  rotation  and  the  pv  dotation  with  p  —  v  —  1.  Finally, 
a  dynamic  range  study  is  pursued.  It  is  observed  that  considerable  improvements  can  be  obtained 
for  the  implementation  of  some  QRD-based  algorithms. 

We  pointed  out  the  trade-offs  between  the  architectures  based  on  the  above  dotations.  Our 
analysis  suggests  the  following  decision  rule  for  selecting  between  the  architectures  that  are  based  on 
the  pv  -Rotation  and  the  kX  Rotation  :  Use  the  pv  Rotation-based  architectures,  with  p  =  v  =  1,  for 
the  constrained  minimization  problems  and  the  kX  Rotation-based  architectures ,  with  k  =  A  =  1,  for 
the  unconstrained  minimization  problems.  Table  5  shows  the  benchmark  comparisons  of  different 
algorithms  using  different  DSP  processors  and  it  confirms  the  properties  claimed  in  this  paper. 

A  number  of  obscure  points  relevant  to  the  realization  of  the  QRD-RLS  and  the  QRD-CRLS 
algorithms  are  clarified.  Some  systolic  structures  that  are  described  in  this  paper  are  very  promising, 
since  they  require  less  computational  complexity  (in  various  aspects)  from  the  structures  known  to 
date  and  they  make  the  VLSI  implementation  easier. 


A  Appendix 


Proof  of  Lemma  3.1: 

First,  we  derive  some  equations  that  will  be  used  in  the  course  of  the  optimal  residual  compu¬ 
tation. 

If  we  solve  (24),  case  i  =  j  =  1,  for  /?/?2a|1  +  and  substitute  in  (22)  we  get 


l[  =  lil 


Q 


2 

1 
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and  therefore 


—  /gfljj/Cj. 

h 


If  we  solve  (24),  case  j  =  i,  for  l[l  ^ /32a^  4-  l{b\l  ^  and  substitute  in  (23)  we  get 


,(i)  _  ^iaii 

q  Ki  ‘ 


If  we  substitute  the  same  expression  in  (22)  we  get 


l'  =  lilWa'aKi. 


In  the  above  expression  we  substitute  4"  ' *  from  (57),  and  solve  for  /•//,•  to  obtain 


7 

h  Ki—1 


If  we  solve  (22)  for  lq  1^f32a2i  4-  ^  and  substitute  in  (23)  we  get 


Also,  we  note  that  (4)  implies  that 


,W  _  1 

9 


Ci  =  firalr'a 


and  by  substituting  (9)  we  obtain 


Similarly,  from  (4)  and  (9),  we  get 


=  l,2,---,p+l. 


The  optimal  residual  for  the  RLS  problem  is  [6] 


^RLsi^n)  —  (  ||  j 
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The  expressions  in  (20)  and  (19)  imply 


v(tn) 


If  we  substitute  the  above  expressions  of  v(tn)  and  ct-  in  (62)  we  obtain 


eRLs(tn)  =  ~  II 


P+i- 


(63) 


From  (59)  we  get 


lM  = 


1  _  K  lP  Kp- 1  lP~  1  ,(p- 2) 

i  \  n  t  »  O  'n 


=  < 


nj=i 

rfc1 


x2 


<2J  A2j-1  l2]-\ 


v2j 

iLl2!^ 


a?. 


_ _ -_1_  ^2  )-l 

4  4  4,-1  4~1 


Kp  h  ^9 


Thus,  from  (63)  and  (64),  for  the  case  of  p  —  2 A:,  we  have 


p  =  2k 
p  =  2k  —  1 


(64) 


CRLsitn )  =  “11 

j=l 


fla2j,2j 
7 ' 

4\2j 


4  @a2j-l,2j-l  l2j-l  f  ^2j  l2 j  k2j-1  4-1 

2 j  a2j-l,2j-l  V  4-1  \K2j  4  ^2j-l  ^2j  — 1 


_ i _ -  6(p) 

A  tp+v 


By  doing  the  appropriate  term  cancellations  and  by  substituting  the  expressions  of  /(//,-,  i  = 
1,2,- • -,2k  from  (56)  and  (58)  we  obtain  the  expression  (26)  for  the  optimal  residual.  Similarly, 
for  the  case  of  p  =  2k  -  1,  from  (63)  and  (64)  we  obtain 


A;—  1 


eRLs(tn )  = 


J-J  |  (3a2j,2j  l2j  Pa2j—\,2j—\  l2j- 1  /  4j-l  4-1  Kh  4 

j= 1  l  a2j,2j  V  l2j  a2j-l,2j-l  '  ‘"2 


^2?  i  o„’  i  V  4-1  VK2i-l  4-1  ^2j  4 


/?a. 


pp 


K 


44?  ^(p) 


1  \\  \2p  P+1 

~pp  V  lP  ^  Ap‘p 

and  by  substituting  (58)  we  get  (26). 

Proof  of  Lemma  3.2: 

The  question  is  whether  we  can  avoid  the  division  in  the  evaluation  of  the  residual.  Obviously 
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Consequently,  at  the  steady  state  we  have 

Jhrn  1 4t_1)/?24  +  Ub{rX)2 1  <  2  (7e“)2  +  2  (ftj)2  . 

Also,  (41),  (42)  and  (48)  imply  that  7 Z?  >  TZ].  Therefore,  we  obtain  the  bound 

lim  +  /t6f-1)2  <4(^)2 

n—+oo  y  ‘ 

and  by  utilizing  (23)  and  the  fact  that  >  0.5  we  get 

lim  A-2  <  2  lim  W~l) P a2i{  +  <8(7 1*)2.  (65) 

n—> oo  n— >00  I  *  ‘ 

By  substituting  the  expression  A,-  =  2T‘,  using  (65)  and  solving  the  resulted  inequality  for  r;  we 
obtain 

lim  |r;|  <  log7£“  +  1.5. 

n— *oo 

The  expression  for  the  dynamic  range  of  r,  in  (50)  is  a  direct  consequence  of  the  above  inequality. 
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Similarly,  for  the  computation  of  the  dynamic  range  of  the  term  px  first  one  can  prove  that 

Bm  |y<0(4i_V4  +  <  is  (K)2 

and  then  compute  an  upper  bound  for  pt  at  the  steady  state  based  on  (22),  (44)  and  the  fact  that 
I'i  >  0.5. 

Proof  of  Lemma  5.3: 

Since  0  <  /3  <  1,  for  the  term  et  we  have 

i  i 

lim  |ei|  =  /3*  JJ  lim  \akk\  <  TT  Tlak. 

n~+  oo  n— ►  oo  AX 

k= 1  k=l 

Similarly,  for  the  term  gi  we  have 

i 

lim  \gt\  =  V  lim  |rfc| 

n— ►  oo  ^ ^  n — ►oo 

k= 1 

and  from  (50) 

Bm  W<  ERJ  =  E(SJ  +  l-5).  (66) 

fc=l  fc  =  l 

(41)  implies  that  the  wordlength  for  the  variable  r  should  satisfy  the  inequality 

Br>r(i-l)(l+log/J)  +  Cl, 

where  C  is  constant  with  respect  to  i.  Since  (i  <  1,  it  is  sufficient  to  have 

>  \i  —  1  +  C~\ , 

or 

+  (67) 

A  similar  formula  can  be  derived  for  the  wordlength  of  the  contents  of  the  the  array  that  utilizes 
the  scaled  rotation,  based  on  (49)  and  (67).  More  specifically,  we  have 

Bf  >B\  +  i-  1. 
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From  this  inequality  and  (66)  we  get 


lim  \eA  <  -j-  ^  ^  +  1.5i. 

n— ►oo  2 

The  dynamic  range  expression  in  (52)  follows  directly. 
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51.1  :  kA 

51.2  :  fj,v 

51.3 

Givens  rotation 

cell 

1 

2 

3 

1 

2 

3 

1 

2 

3 

number  of 

P 

1 

P 

SUSA 

1 

P 

p(p+i) 

2 

1 

sq.rt 

- 

- 

- 

- 

- 

- 

1 

- 

div. 

- 

- 

1 

1 

- 

- 

1 

- 

- 

mult. 

9 

4 

1 

5 

3 

1 

4 

4 

1 

i/o 

9 

10 

4 

6 

8 

3 

5 

6 

3 

Table  1:  RLS  residual  computational  complexity. 


52.1 

:  kX 

52.2 

:  yv 

52.3 

:  Givens  rotation 

cell 

1 

2 

3 

4 

1 

2 

3 

4 

1 

2 

3 

number  of 

P 

KS39I 

P 

KSS9 

P 

P 

(p-i)p 
_ 2 _ 

P 

ESI 

sq.rt 

- 

- 

- 

- 

- 

- 

- 

- 

1 

- 

- 

div. 

- 

- 

- 

- 

1 

- 

- 

- 

1 

- 

- 

- 

mult. 

8 

4 

4 

5 

5 

3 

3 

4 

4 

4 

4 

5 

i/o 

7 

10 

11 

14 

6 

8 

9 

12 

3 

6 

7 

10 

Table  2:  RLS  weight  extraction  computational  complexity  (mode  0). 
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53.1 

kA 

53.2 

pp 

53.3 

:  Givens  rotation 

cell 

1 

3 

4 

1 

2 

3 

4 

1 

3 

4 

number  of 

P 

Np 

N 

n 

151*111 

P 

Np 

N 

sq.rt 

- 

- 

- 

- 

- 

- 

- 

- 

1 

- 

- 

- 

div. 

1 

- 

- 

1 

l 

- 

- 

1 

1 

- 

- 

1 

mult. 

9 

4 

6 

3 

6 

3 

5 

2 

5 

4 

5 

2 

i/o 

10 

12 

14 

7 

7 

10 

12 

5 

5 

6 

8 

5 

Table  3:  CRLS  optimal  residual  computational  complexity  (mode  0). 


54.1 

kA 

54.2 

pp 

cell 

1 

2 

3 

4 

5 

1 

2 

3 

4 

5 

number  of 

P 

(p-^p 

2 

Np 

Wp(p+1) 

2 

Np 

P 

(p-i)p 

2 

Np 

Np(p+ 1) 

2 

Np 

sq.rt 

- 

- 

- 

- 

- 

- 

div. 

1 

- 

- 

- 

1 

1 

- 

- 

- 

1 

mult. 

8 

4 

6 

5 

1 

5 

3 

5 

4 

- 

i/o 

8 

12 

19 

14 

4 

6 

8 

14 

10 

4 

54.3: 

Givens  rotation 

cell 

1 

2 

3 

4 

5 

number  of 

P 

BfKBira 

Np 

sq.rt 

1 

- 

- 

- 

- 

div. 

1 

- 

- 

- 

1 

mult. 

4 

4 

5 

5 

- 

i/o 

4 

8 

13 

10 

4 

Table  4:  CRLS  weight  vector  extraction  comp,  complexity  (mode  0). 


operations-per-cycle 

DSP96000 

(ns) 

IMS  T800 
(ns) 

WEITEK  3164 
(ns) 

ADSP-3201/2 

(ns) 

Sl.l 

max{l  div.  +  1  mult.  ,  9  mult.  } 

900 

3150 

1800 

2700 

SI. 2 

1  div.  +  5  mult. 

1020 

2300 

2700 

3675 

SI. 3 

1  sq.rt.  +  1  div.  +  4  mult. 

1810 

4500 

5300 

7175 

S2.1 

8  mult. 

1600 

BET  ■ 

S2.2 

1  div.  +  5  mult. 

2700 

3675 

S2.3 

1  sq.rt.  +  1  div.  +  4  mult. 

1810 

5300 

7175 

S3.1 

1  div.  +  9  mult. 

1420 

3500 

4875 

S3.2 

1  div.  +  6  mult. 

1120 

2900 

3975 

S3. 3 

1  sq.rt.  +  1  div.  +  5  mult. 

1810 

5300 

7175 

S4.1 

1  div.  +  8  mult. 

1320 

3300 

4575 

S4.2 

1  div.  +  5  mult. 

2700 

3675 

S4.3 

1  sq.rt.  +  1  div.  -f  4  mult. 

1810 

5300 

7175 

Table  5:  Minimum  required  delay  between  two  consequent  sets  of  input  data. 
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Figure  1:  The  relations  among  the  classes  of  algorithms  based  on  QR  decomposition,  a  Rotation  al¬ 
gorithm,  a  fiu  dotation  and  a  k\  dotation. 


x  < —  r 


Figure  2:  51.1  :  Systolic  array  that  computes  the  RLS  optimal  residual.  It  implements  the  algorithm 
that  is  based  on  the  k\  dotation  for  which  k  =  A  =  1. 


mode  1:  b  •  b^-y  *  * 


mode  1 :  if  b  Jn=  1  then  r  y 


Figure  3:  52.1  :  Systolic  array  that  computes  the  RLS  optimal  weight  vector.  It  implements  the 
algorithm  that  is  based  on  the  k\  ^Rotation  for  which  k  =  A  —  1 . 


mode  0  : 


z 


C  •  Z  +  S 

P 


b. 

m 


Tl.  +t' 
in 


mode  1 :  if  b  j  *  1  then  z  «—  y  •  t 


Figure  4:  .S'3.1  :  Systolic  array  that  computes  the  CRLS  optimal  residual.  It  implements  the 
algorithm  that  is  based  on  the  kX  dotations  for  which  k  =  A  =  1. 
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Figure  5:  54.1  :  Systolic  array  that  computes  the  CRLS  optimal  weight  vector.  It  implements  the 
algorithm  that  is  based  on  the  kA  Rotation  for  which  k  =  A  =  1. 
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Figure  6:  Systolic  array  that  computes  the  RLS  optimal  residual  based  on  the  scaled  square  root 
free  and  division  free  dotation. 


